Snapshot arbitration techniques for memory requests

ABSTRACT

Techniques are disclosed relating to arbitration for computer memory resources. In some embodiments, an apparatus includes queue circuitry that implements multiple queues configured to queue requests to access a memory bus. Control circuitry may, in response to detecting a first threshold condition associated with the queue circuitry, generate a first snapshot that indicates numbers of requests in respective queues of the multiple queues at a first time. The control circuitry may generate a second snapshot that indicates numbers of requests in respective queues of the multiple queues at a second time that is subsequent to the first time. The control circuitry may arbitrate between requests from the multiple queues to select requests to access the memory bus, where the arbitration is based on snapshots to which requests from the multiple queues belong. Disclosed techniques may approximate age-based scheduling while reducing area and power consumption.

BACKGROUND Technical Field

This disclosure relates to arbitration for computer memory resources.

Description of the Related Art

Computer memory is typically available to multiple requesting agents viaone or more channels. Control circuitry may arbitrate among requests togrant access to a particular memory. Some types of arbitration such asround-robin selection may be inexpensive in terms of area and powerconsumption but may provide poor performance for clients with largernumbers of requests. Other types of arbitration such as age-basedschemes may provide more fairness among requesters by selecting requestsin the order they were submitted, but may be relatively expensive interms of area and power consumption.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a diagram illustrating an overview of example graphicsprocessing operations, according to some embodiments.

FIG. 1B is a block diagram illustrating an example graphics unit,according to some embodiments.

FIG. 2 is a block diagram illustrating example snapshot control andarbitration circuitry, according to some embodiments.

FIG. 3 is a diagram illustrating example counter-based snapshot statesat different points in time, according to some embodiments.

FIG. 4 is a flow diagram illustrating an example snapshot-basedarbitration technique, according to some embodiments.

FIG. 5 is a block diagram illustrating example an example split cachearchitecture, according to some embodiments.

FIG. 6 is a block diagram illustrating router circuitry that mayincorporate snapshot-based arbitration, according to some embodiments.

FIG. 7 is a block diagram illustrating example router input circuitry,according to some embodiments.

FIG. 8 is a block diagram illustrating example router output circuitry,according to some embodiments.

FIG. 9 is a block diagram illustrating example snapshot overridecircuitry, according to some embodiments.

FIG. 10 is a flow diagram illustrating an example method, according tosome embodiments.

FIG. 11 is a block diagram illustrating an example computing device,according to some embodiments.

FIG. 12 is a diagram illustrating example applications of disclosedsystems and devices, according to some embodiments.

FIG. 13 is a block diagram illustrating an example computer-readablemedium that stores circuit design information, according to someembodiments.

DETAILED DESCRIPTION

In disclosed embodiments snapshot-based arbitration may provide similardesirable arbitration performance to timestamp arbitration in certainscenarios, but with reduced circuitry complexity and power consumption.FIGS. 1A-1B provide an overview of graphics processing, which mayutilize disclosed arbitration techniques (although these techniques mayalso be implemented in other scenarios such as CPUs or memory managementunits). FIGS. 2-4 provide examples of snapshot-based arbitrationtechniques. FIGS. 5-8 provide example routing circuitry for a splitcache architecture that may utilize disclosed arbitration techniques.FIG. 9 shows circuitry configured to override an oldest snapshot incertain scenarios. FIGS. 10-13 provide example methods, devices,systems, and computer-readable media.

Graphics Processing Overview

Referring to FIG. 1A, a flow diagram illustrating an example processingflow 100 for processing graphics data is shown. In some embodiments,transform and lighting procedure 110 may involve processing lightinginformation for vertices received from an application based on definedlight source locations, reflectance, etc., assembling the vertices intopolygons (e.g., triangles), and transforming the polygons to the correctsize and orientation based on position in a three-dimensional space.Clip procedure 115 may involve discarding polygons or vertices that falloutside of a viewable area. Rasterize procedure 120 may involve definingfragments within each polygon and assigning initial color values foreach fragment, e.g., based on texture coordinates of the vertices of thepolygon. Fragments may specify attributes for pixels which they overlap,but the actual pixel attributes may be determined based on combiningmultiple fragments (e.g., in a frame buffer), ignoring one or morefragments (e.g., if they are covered by other objects), or both. Shadeprocedure 130 may involve altering pixel components based on lighting,shadows, bump mapping, translucency, etc. Shaded pixels may be assembledin a frame buffer 135. Modern GPUs typically include programmableshaders that allow customization of shading and other processingprocedures by application developers. Thus, in various embodiments, theexample elements of FIG. 1A may be performed in various orders,performed in parallel, or omitted. Additional processing procedures mayalso be implemented.

Referring now to FIG. 1B, a simplified block diagram illustrating agraphics unit 150 is shown, according to some embodiments. In theillustrated embodiment, graphics unit 150 includes programmable shader160, vertex pipe 185, fragment pipe 175, texture processing unit (TPU)165, image write unit 170, and memory interface 180. In someembodiments, graphics unit 150 is configured to process both vertex andfragment data using programmable shader 160, which may be configured toprocess graphics data in parallel using multiple execution pipelines orinstances.

Vertex pipe 185, in the illustrated embodiment, may include variousfixed-function hardware configured to process vertex data. Vertex pipe185 may be configured to communicate with programmable shader 160 inorder to coordinate vertex processing. In the illustrated embodiment,vertex pipe 185 is configured to send processed data to fragment pipe175 or programmable shader 160 for further processing.

Fragment pipe 175, in the illustrated embodiment, may include variousfixed-function hardware configured to process pixel data. Fragment pipe175 may be configured to communicate with programmable shader 160 inorder to coordinate fragment processing. Fragment pipe 175 may beconfigured to perform rasterization on polygons from vertex pipe 185 orprogrammable shader 160 to generate fragment data. Vertex pipe 185 andfragment pipe 175 may be coupled to memory interface 180 (coupling notshown) in order to access graphics data.

Programmable shader 160, in the illustrated embodiment, is configured toreceive vertex data from vertex pipe 185 and fragment data from fragmentpipe 175 and TPU 165. Programmable shader 160 may be configured toperform vertex processing tasks on vertex data which may include varioustransformations and adjustments of vertex data. Programmable shader 160,in the illustrated embodiment, is also configured to perform fragmentprocessing tasks on pixel data such as texturing and shading, forexample. Programmable shader 160 may include multiple sets of multipleexecution pipelines for processing data in parallel.

In some embodiments, programmable shader includes pipelines configuredto execute one or more different SIMD groups in parallel. Each pipelinemay include various stages configured to perform operations in a givenclock cycle, such as fetch, decode, issue, execute, etc. The concept ofa processor “pipeline” is well understood, and refers to the concept ofsplitting the “work” a processor performs on instructions into multiplestages. In some embodiments, instruction decode, dispatch, execution(i.e., performance), and retirement may be examples of differentpipeline stages. Many different pipeline architectures are possible withvarying orderings of elements/portions. Various pipeline stages performsuch steps on an instruction during one or more processor clock cycles,then pass the instruction or operations associated with the instructionon to other stages for further processing.

The term “SIMD group” is intended to be interpreted according to itswell-understood meaning, which includes a set of threads for whichprocessing hardware processes the same instruction in parallel usingdifferent input data for the different threads. Various types ofcomputer processors may include sets of pipelines configured to executeSIMD instructions. For example, graphics processors often includeprogrammable shader cores that are configured to execute instructionsfor a set of related threads in a SIMD fashion. Other examples of namesthat may be used for a SIMD group include: a wavefront, a clique, or awarp. A SIMD group may be a part of a larger thread group, which may bebroken up into a number of SIMD groups based on the parallel processingcapabilities of a computer. In some embodiments, each thread is assignedto a hardware pipeline that fetches operands for that thread andperforms the specified operations in parallel with other pipelines forthe set of threads. Note that processors may have a large number ofpipelines such that multiple separate SIMD groups may also execute inparallel. In some embodiments, each thread has private operand storage,e.g., in a register file. Thus, a read of a particular register from theregister file may provide the version of the register for each thread ina SIMD group.

In some embodiments, multiple programmable shader units 160 are includedin a GPU. In these embodiments, global control circuitry may assign workto the different sub-portions of the GPU which may in turn assign workto shader cores to be processed by shader pipelines.

TPU 165, in the illustrated embodiment, is configured to schedulefragment processing tasks from programmable shader 160. In someembodiments, TPU 165 is configured to pre-fetch texture data and assigninitial colors to fragments for further processing by programmableshader 160 (e.g., via memory interface 180). TPU 165 may be configuredto provide fragment components in normalized integer formats orfloating-point formats, for example. In some embodiments, TPU 165 isconfigured to provide fragments in groups of four (a “fragment quad”) ina 2×2 format to be processed by a group of four execution pipelines inprogrammable shader 160.

Image write unit (IWU) 170, in some embodiments, is configured to storeprocessed tiles of an image and may perform operations to a renderedimage before it is transferred for display or to memory for storage. Insome embodiments, graphics unit 150 is configured to perform tile-baseddeferred rendering (TBDR). In tile-based rendering, different portionsof the screen space (e.g., squares or rectangles of pixels) may beprocessed separately. Memory interface 180 may facilitate communicationswith one or more of various memory hierarchies in various embodiments.

Overview of Snapshot-Based Arbitration

FIG. 2 is a block diagram illustrating example circuitry that supportssnapshot arbitration, according to some embodiments. In the illustratedembodiment, circuitry includes request interface 210, queues 220A-220N,snapshot control circuitry 230, and arbitration circuitry 240.

Request interface 210, in the illustrated embodiment, is configured toreceive requests from multiple different channels and assign requests tocorresponding queues 220. The channels may be different physicalchannels or may be virtual channels that share underlying physicalbuses. In some embodiments, request interface 210 receives router datafrom multiple directions and regroups requests by channel, as discussedin detail below with reference to FIG. 8. In some embodiments requestinterface 210 may be omitted and requests may be submitted directly to aqueue 220.

Queues 220, in the illustrated embodiment, are first-in first-out (FIFO)queues configured to store requests for arbitration and outputs requestsin the order in which they were received. In the illustrated embodiment,the head of each queue is eligible for arbitration, assuming any otherrequirements for availability are met (e.g., assuming the queue hassufficient credits to submit requests for arbitration in credit-basedimplementations).

Snapshot control circuitry 230, in the illustrated embodiment, isconfigured to generate snapshots at different times, e.g., in responseto an event such a threshold queue status being satisfied. For example,snapshot control circuitry may monitor for events such as a queuemeeting a threshold number of valid entries, a threshold number ofqueues meeting a threshold number of valid entries, the total number ofentries in queues 220 meeting a threshold number of valid entries, etc.

In some embodiments, snapshot control circuitry 230 supports generatingand tracking multiple snapshots at the same time. In some embodiments,supporting a greater number of snapshots may provide a closerapproximation to timestamp-based arbitration but may increase arearequirements, e.g., for counters to store data for different snapshots.Therefore, different embodiments may support different numbers ofsnapshots or a given embodiment may support multiple operating modeswith different maximum numbers of snapshots per mode.

Arbitration circuitry 240, in the illustrated embodiment, is configuredto select from among available entries in queues 220 to provide selectedrequests (e.g., to allow those requests to access a memory such as acache or to access a memory bus). In some embodiments arbitrationcircuitry is configured to arbitrate according to one or more defaultmodes of operation prior to beginning snapshot-based arbitration. Notethat arbitration circuitry 240 may implement one level of arbitration ina multi-level arbitration scheme. Therefore, requests may proceedthrough arbitration prior to and after the disclosed arbitration atarbitration circuitry 240. Other levels of arbitration may use similaror different arbitration techniques.

In some embodiments arbitration circuitry is configured to performweighted round-robin arbitration among queues prior to initiatingsnapshot arbitration. In weighted round-robin mode, some queues have agreater weight than other queues. The weights may be based on expectedtraffic from different channels, for example. In some situations,however, one queue may have an unexpected number of requests or theoverall traffic may increase, which may trigger a snapshot event. Atthis time, snapshot control circuitry 230 may generate snapshot data andsignal to arbitration circuitry 240 to begin snapshot-based arbitration.

Snapshot-based arbitration may take various forms in differentembodiments or operating modes. For example, strict snapshot arbitrationmay select all requests from the current oldest snapshot beforeselecting any requests that are not in this snapshot (although thecurrent snapshot may be overridden in certain circumstances). In otherembodiments, snapshot information may be used as one input among othersin an arbitration procedure, e.g., to provide a greater priority weightto requests from the current oldest snapshot than to other requests.

FIG. 3 is a diagram illustrating example counter-based snapshot data atdifferent times, according to some embodiments. The upper portion ofFIG. 3 represents a first point in time and the lower portion of FIG. 3represents a second, later point in time.

At the first point in time, snapshot 1 has four entries in queue A andtwo entries in queue N. At the second point in time, another snapshot(snapshot 2) has been taken and snapshot 1 only has one entry in queue Aand two entries in queue N. At this point, snapshot 2 has two entries inqueue A (including one that is also in snapshot 1) and five entries inqueue N (including two that are also in snapshot 1).

In the illustrated embodiment, circuitry for each snapshot may include acounter for each queue that supports counting up to the maximum numberof entries in the queue. Note that the disclosed counters are includedfor purposes of explanation but are not intended to limit the scope ofthe present disclosure. In other embodiments, any of various encodingsmay be used to represent snapshots, e.g., with separate counts such thatsnapshots are non-overlapping (in which case the counts for snapshot 2in FIG. 3 would be one for queue A and three for queue N), pointer-basedtracking of the tail of snapshots, tracking a request identifier orother content of the last request per queue in a snapshot, etc. Invarious embodiments, snapshot control circuitry 230 may indicate toarbitration circuitry 240 which queues have head entries that are in thecurrent oldest snapshot (or may alternatively identify the snapshot ofthe head entries of each queue).

FIG. 4 is a flow diagram illustrating an example method forsnapshot-based arbitration, according to some embodiments. At 410, inthe illustrated embodiment, arbitration circuitry 240 selects a requestusing a default arbitration technique. At 420, the system determineswhether a snapshot has been triggered, if not, flow proceeds to 410 andarbitration circuitry 240 selects another request using the defaulttechnique. If a snapshot has been triggered, flow proceeds to 430. Notethat additional snapshots may be triggered while in snapshot-basedarbitration mode, based on similar threshold(s) to the initial snapshotor based on different threshold(s) that are specific to situations wherea snapshot has already been triggered.

At 430, the system determines whether there is a snapshot deferralscenario. In the illustrated embodiment, if there is no a deferralscenario (also referred to herein as an override scenario), arbitrationcircuitry selects a request within the oldest snapshot, e.g., usinground-robin techniques among queues. In some embodiments, snapshotcontrol circuitry 230 provides a signal indicating whether the head ofeach queue is included in the oldest snapshot and arbitration circuitry240 may not consider requests that are not in the oldest snapshot. If asnapshot deferral scenario is detected, flow proceeds to 440.

At 440, in the illustrated example, arbitration circuitry 240 selects arequest from within the N oldest snapshots (e.g., the oldest twosnapshots). The number of snapshots N may be fixed or may be dynamicallydetermined based on current operating conditions. This may allowselection from younger snapshots when the oldest snapshot does not haveavailable requests, e.g., due to all channels with requests for theoldest snapshot waiting for credits from downstream circuitry. (Notethat circuitry may implement other deferral techniques and techniquesfor arbitrating among different snapshots, as discussed in detail belowwith reference to FIG. 9.)

At 460, in the illustrated embodiment, the system determines whether alloutstanding snapshots have been completed. If so, flow proceeds to 410and default arbitration proceeds. If not, flow proceeds to 430 andsnapshot-based arbitration continues. Note that in other embodiments,the snapshot control circuitry 230 may exit snapshot-based arbitrationbefore completing all outstanding snapshots, e.g., when queue entriesare below one or more thresholds. In this situation, snapshot controlcircuitry 230 may clear any remaining snapshot data or may retain theremaining snapshot data in case snapshot-based arbitration is triggeredagain.

Note that in some embodiments, certain channels may have a fixedpriority or be otherwise prioritized. Therefore, arbitration circuitry240 may select from these queues even when the requests from the queuesare not in the oldest snapshot during snapshot-based arbitration, insome scenarios. Said another way, snapshot-based arbitration may be usedfor only a subset of queues while other queues may use other arbitrationschemes. Generally, snapshot information may be used as an arbitrationinput in combination with various other appropriate arbitration data.

Example Split Cache Architecture with Routing Circuitry that ImplementsSnapshots

FIG. 5 is a block diagram illustrating an example split cachearchitecture, according to some embodiments. In the illustratedembodiment, a device includes multiple processor sub-units 510A-510N(which may be GPU instances connected to form a larger graphicsprocessor, for example), multiple interfaces 515A-515N, multiple cachesubsets 520A-520N, and multiple routers 525A-525N.

The processor sub-units 510 may be included on the same die or ondifferent dies and may communicate via one or more communication fabrics(which may include separate networks for workload management controlsignals and data, for example). The processor sub-units may includevarious agents that a memory hierarchy that includes cache subsets 520.

Cache subsets 520, in some embodiments, store different subsets of agiven cache level (e.g., an L2 cache, L1 cache, etc.). The cacheimplemented by subsets 520 may be an instruction cache, a data cache, orboth. In some embodiments, each cache subset implements a subset of acache memory space and the subsets do not overlap. In other embodiments,processor sub-units 510 may have corresponding memory sub-sets of amemory space, which may not be a cache space. Having a split cachearchitecture may be advantageous in distributed GPU architectures, forexample.

Interfaces 515, in some embodiments, are configured to receive memorytransaction requests from a corresponding processor sub-unit 510 or froma router 525. For transaction requests with addresses in thecorresponding cache subset 520, the interface may submit the requests tothe cache subset. For requests with addresses in a different cachesubset, an interface 515 forwards the transaction requests to a router525 for routing to the appropriate cache subset 520. In someembodiments, interfaces 515 implement a portion of a credit interfaceand may track credits for different channels.

Routers 525, in some embodiments, are configured to route transactionrequests in multiple directions. For example, for a system with N cachesubsets, routers 525 may be configured to route transactions to N−1other routers. In some embodiments, each router 525 implements aninstance of arbitration circuitry 240. Disclosed arbitration techniquesmay provide snapshot arbitration to approximate time-based arbitrationwhen queue thresholds are reached. FIGS. 6-8 provide a more detailedillustration of example router 525 embodiments.

FIG. 6 is a block diagram illustrating a router 525 that includesmultiple input stages 610 and multiple output stages 620. In theillustrated embodiment, each input stage includes a bus to send requestdata to each output stage. For example, in an embodiment with four inputstages and four output stages, each input stage may provide four outputbuses and each output stage may receive four input buses.

In some embodiments, each input stage is configured to route requestdata to the appropriate output stage, e.g., based on a portion of theaddress associated with the request. In the illustrated embodiment,input stages provide credit data and output stages receive credit data.Note that credit tracking is one example technique to provide quality ofservice among multiple channels, but is not intended to limit the scopeof the present disclosure; other techniques are contemplated.

FIG. 7 is a block diagram illustrating an example input stage for arouter, according to some embodiments. In the illustrated embodiment,the input stage includes credit control circuitry 710, command buffer720, data buffer 730, virtual channel (VC) queues 740A-740N, and regroupby direction circuitry 750. In some embodiments, some incomingtransaction requests are split into command portions stored in commandbuffer 720 and data portions stored in data buffer 730. Requests aretracked in a corresponding in-order VC queue 740, e.g., depending on thetype of request. Note that multiple queues 740 may be implemented for agiven virtual channel. In some embodiments, separate queues aremaintained for writes and reads. The queues may avoid blocking betweenchannels while maintaining packet ordering for a given channel. Creditcontrol circuitry 710 may forward credit information from downstreamcircuitry to requesting agents.

Regroup by direction circuitry 750, in the illustrated embodiment, isconfigured to send packets to the appropriate output stage for theproper router direction, e.g., based on a portion of an address of thetransaction. For example, different transactions within the same channelmay be routed to different directions based on their addresses.

FIG. 8 is a block diagram illustrating an example output stage for arouter that includes arbitration circuitry, according to someembodiments. In the illustrated embodiment, the output stage receivesinternal router data from multiple input stages. The output stage, inthis example, includes regroup by channel circuitry 810, arbitrationcircuitry 820, and credit control circuitry 830.

Regroup by channel circuitry 810, in some embodiments, is configured toregroup router data from multiple directions by queue. Therefore, eachchannel may present one or more input packets to arbitration circuitry820 from the VC queue(s) 740 in one or more input stages.

Arbitration circuitry 820 may arbitrate among queues, among channels, orboth, e.g., according to the techniques discussed above with referenceto arbitration circuitry 240. In some embodiments, only inputs withavailable credits (e.g., as indicated by credit control 830) areeligible for arbitration. The arbitration circuitry 820 outputs selectedrequests, e.g., via an interface 515 to a cache subset 520. In someembodiments, arbitration circuitry 820 first arbitrates among virtualchannels and then arbitrates among queues for a given virtual channel.

FIG. 9 is a block diagram illustrating example snapshot overridecircuitry, according to some embodiments. In the illustrated embodiment,circuitry 910 receives impediment information and information indicatingremaining queue entries for the current snapshot and outputs an overridesignal to arbitration circuitry. In some embodiments, circuitry 910 isconfigured to allow selection of requests from one or more othersnapshots that are not included in the oldest snapshot based on theimpediment information. For example, the impediment information mayindicate that all requests for the oldest snapshot are waiting forcredits and therefore unavailable for arbitration.

Speaking generally, multiple types of impediments may be used tooverride snapshot arbitration, and snapshot arbitration may not requirefixed selection of the oldest snapshot before any other snapshots. Forexample, snapshots may be used to weight requests as an input to anarbitration algorithm, but the arbitration circuitry 240 may selectrequests based on various other arbitration inputs as well.

Example Method

FIG. 10 is a flow diagram illustrating an example method forsnapshot-based arbitration, according to some embodiments. The methodshown in FIG. 10 may be used in conjunction with any of the computercircuitry, systems, devices, elements, or components disclosed herein,among others. In various embodiments, some of the method elements shownmay be performed concurrently, in a different order than shown, or maybe omitted. Additional method elements may also be performed as desired.

At 1010, in the illustrated embodiment, control circuitry generates afirst snapshot that indicates numbers of requests in respective queuesof multiple queues at a first time. The generating is performed inresponse to detecting a first threshold condition associated with queuecircuitry that includes the multiple queues. One or more of the queuesmay maintain ordering among requests of one or more virtual channelsthat are shared by multiple client circuits.

At 1020, in the illustrated embodiment, control circuitry generates asecond snapshot that indicates numbers of requests in respective queuesof the multiple queues at a second time that is subsequent to the firsttime. Thus, in some embodiments, the control circuitry supports trackingmultiple snapshots at the same time.

At 1030, in the illustrated embodiment, control circuitry arbitratesbetween requests from the multiple queues to select requests to accessthe memory bus, wherein the arbitrating is based on snapshots to whichrequests from the multiple queues belong.

In some embodiments, the control circuitry is configured to select allavailable requests from the first snapshot before selecting any requestsfrom the second snapshot that are not in the first snapshot. As oneexample, the arbitration circuitry may select one or more requests fromthe second snapshot, that are not in the first snapshot, while one ormore requests from the first snapshot are still queued, e.g., if the oneor more requests from the first snapshot are not available. Requestsfrom the first snapshot may not be available for various reasons. As oneexample, requests from the first snapshot may be unavailable due to lackof credits.

In some embodiments, the control circuitry is configured to provide agreater priority weight to requests from the first snapshot thanrequests from the second snapshot that are not in the first snapshot.This may result in requests from the first snapshot being selectedpreferentially over request from the second snapshot.

In some embodiments, the control circuitry is configured to select allrequests from the first snapshot before selecting any requests from thesecond snapshot that are not in the first snapshot.

In some embodiments, the control circuitry is configured to use weightedround-robin arbitration among the queues prior to detecting the firstthreshold condition and round-robin arbitration among queues forrequests in the first snapshot subsequent to detecting the firstthreshold condition.

In some embodiments, the control circuitry includes at least three setsof counters configured to maintain information that indicates currentnumbers of requests in three or more snapshots for respective ones ofthe queues. In some embodiments, the control circuitry is configured toupdate the counters based on queued requests winning arbitration.

In some embodiments, the control circuitry is included in a split cachearchitecture in which a cache includes a first portion associated with afirst processor sub-unit and a second portion associated with a secondprocessor sub-unit. In some embodiments, the device includes routercircuitry configured to route cache access requests from the secondprocessor sub-unit to the first portion of the cache. The queuecircuitry and control circuitry may be included in the router circuitry.In some embodiments, the cache includes at least three portionsassociated with respective processor sub-units and the router circuitryis configured to route cache access requests in at least threedirections.

In some embodiments, the queue circuitry and control circuitry areincluded in circuitry that routes memory access requests for one or moreshader cores of a graphics processor.

Example Device

Referring now to FIG. 11, a block diagram illustrating an exampleembodiment of a device 1100 is shown. In some embodiments, elements ofdevice 1100 may be included within a system on a chip. In someembodiments, device 1100 may be included in a mobile device, which maybe battery-powered. Therefore, power consumption by device 1100 may bean important design consideration. In the illustrated embodiment, device1100 includes fabric 1110, compute complex 1120 input/output (I/O)bridge 1150, cache/memory controller 1145, graphics unit 1175, anddisplay unit 1165. In some embodiments, device 1100 may include othercomponents (not shown) in addition to or in place of the illustratedcomponents, such as video processor encoders and decoders, imageprocessing or recognition elements, computer vision elements, etc.

Fabric 1110 may include various interconnects, buses, MUX's,controllers, etc., and may be configured to facilitate communicationbetween various elements of device 1100. In some embodiments, portionsof fabric 1110 may be configured to implement various differentcommunication protocols. In other embodiments, fabric 1110 may implementa single communication protocol and elements coupled to fabric 1110 mayconvert from the single communication protocol to other communicationprotocols internally.

The disclosed arbitration circuitry that supports snapshotting may beutilized at one or more of various locations within device 1100,including, without limitation, graphics unit 1175, fabric 1110, computecomplex 1120, cache/memory controller 1145, etc.

In the illustrated embodiment, compute complex 1120 includes businterface unit (BIU) 1125, cache 1130, and cores 1135 and 1140. Invarious embodiments, compute complex 1120 may include various numbers ofprocessors, processor cores and caches. For example, compute complex1120 may include 1, 2, or 4 processor cores, or any other suitablenumber. In one embodiment, cache 1130 is a set associative L2 cache. Insome embodiments, cores 1135 and 1140 may include internal instructionand data caches. In some embodiments, a coherency unit (not shown) infabric 1110, cache 1130, or elsewhere in device 1100 may be configuredto maintain coherency between various caches of device 1100. BIU 1125may be configured to manage communication between compute complex 1120and other elements of device 1100. Processor cores such as cores 1135and 1140 may be configured to execute instructions of a particularinstruction set architecture (ISA) which may include operating systeminstructions and user application instructions.

Cache/memory controller 1145 may be configured to manage transfer ofdata between fabric 1110 and one or more caches and memories. Forexample, cache/memory controller 1145 may be coupled to an L3 cache,which may in turn be coupled to a system memory. In other embodiments,cache/memory controller 1145 may be directly coupled to a memory. Insome embodiments, cache/memory controller 1145 may include one or moreinternal caches.

As used herein, the term “coupled to” may indicate one or moreconnections between elements, and a coupling may include interveningelements. For example, in FIG. 11, graphics unit 1175 may be describedas “coupled to” a memory through fabric 1110 and cache/memory controller1145. In contrast, in the illustrated embodiment of FIG. 11, graphicsunit 1175 is “directly coupled” to fabric 1110 because there are nointervening elements.

Graphics unit 1175 may include one or more processors, e.g., one or moregraphics processing units (GPU's). Graphics unit 1175 may receivegraphics-oriented instructions, such as OPENGL®, Metal, or DIRECT3D®instructions, for example. Graphics unit 1175 may execute specializedGPU instructions or perform other operations based on the receivedgraphics-oriented instructions. Graphics unit 1175 may generally beconfigured to process large blocks of data in parallel and may buildimages in a frame buffer for output to a display, which may be includedin the device or may be a separate device. Graphics unit 1175 mayinclude transform, lighting, triangle, and rendering engines in one ormore graphics processing pipelines. Graphics unit 1175 may output pixelinformation for display images. Graphics unit 1175, in variousembodiments, may include programmable shader circuitry which may includehighly parallel execution cores configured to execute graphics programs,which may include pixel tasks, vertex tasks, and compute tasks (whichmay or may not be graphics-related).

Display unit 1165 may be configured to read data from a frame buffer andprovide a stream of pixel values for display. Display unit 1165 may beconfigured as a display pipeline in some embodiments. Additionally,display unit 1165 may be configured to blend multiple frames to producean output frame. Further, display unit 1165 may include one or moreinterfaces (e.g., MIPI® or embedded display port (eDP)) for coupling toa user display (e.g., a touchscreen or an external display).

I/O bridge 1150 may include various elements configured to implement:universal serial bus (USB) communications, security, audio, andlow-power always-on functionality, for example. I/O bridge 1150 may alsoinclude interfaces such as pulse-width modulation (PWM), general-purposeinput/output (GPIO), serial peripheral interface (SPI), andinter-integrated circuit (I2C), for example. Various types ofperipherals and devices may be coupled to device 1100 via I/O bridge1150.

In some embodiments, device 1100 includes network interface circuitry(not explicitly shown), which may be connected to fabric 1110 or I/Obridge 1150. The network interface circuitry may be configured tocommunicate via various networks, which may be wired, wireless, or both.For example, the network interface circuitry may be configured tocommunicate via a wired local area network, a wireless local areanetwork (e.g., via WiFi), or a wide area network (e.g., the Internet ora virtual private network). In some embodiments, the network interfacecircuitry is configured to communicate via one or more cellular networksthat use one or more radio access technologies. In some embodiments, thenetwork interface circuitry is configured to communicate usingdevice-to-device communications (e.g., Bluetooth or WiFi Direct), etc.In various embodiments, the network interface circuitry may providedevice 1100 with connectivity to various types of other devices andnetworks.

Example Applications

Turning now to FIG. 12, various types of systems that may include any ofthe circuits, devices, or system discussed above. System or device 1200,which may incorporate or otherwise utilize one or more of the techniquesdescribed herein, may be utilized in a wide range of areas. For example,system or device 1200 may be utilized as part of the hardware of systemssuch as a desktop computer 1210, laptop computer 1220, tablet computer1230, cellular or mobile phone 1240, or television 1250 (or set-top boxcoupled to a television).

Similarly, disclosed elements may be utilized in a wearable device 1260,such as a smartwatch or a health-monitoring device. Smartwatches, inmany embodiments, may implement a variety of different functions—forexample, access to email, cellular service, calendar, health monitoring,etc. A wearable device may also be designed solely to performhealth-monitoring functions, such as monitoring a user's vital signs,performing epidemiological functions such as contact tracing, providingcommunication to an emergency medical service, etc. Other types ofdevices are also contemplated, including devices worn on the neck,devices implantable in the human body, glasses or a helmet designed toprovide computer-generated reality experiences such as those based onaugmented and/or virtual reality, etc.

System or device 1200 may also be used in various other contexts. Forexample, system or device 1200 may be utilized in the context of aserver computer system, such as a dedicated server or on shared hardwarethat implements a cloud-based service 1270. Still further, system ordevice 1200 may be implemented in a wide range of specialized everydaydevices, including devices 1280 commonly found in the home such asrefrigerators, thermostats, security cameras, etc. The interconnectionof such devices is often referred to as the “Internet of Things” (IoT).Elements may also be implemented in various modes of transportation. Forexample, system or device 1200 could be employed in the control systems,guidance systems, entertainment systems, etc. of various types ofvehicles 1290.

The applications illustrated in FIG. 12 are merely exemplary and are notintended to limit the potential future applications of disclosed systemsor devices. Other example applications include, without limitation:portable gaming devices, music players, data storage devices, unmannedaerial vehicles, etc.

Example Computer-Readable Medium

The present disclosure has described various example circuits in detailabove. It is intended that the present disclosure cover not onlyembodiments that include such circuitry, but also a computer-readablestorage medium that includes design information that specifies suchcircuitry. Accordingly, the present disclosure is intended to supportclaims that cover not only an apparatus that includes the disclosedcircuitry, but also a storage medium that specifies the circuitry in aformat that is recognized by a fabrication system configured to producehardware (e.g., an integrated circuit) that includes the disclosedcircuitry. Claims to such a storage medium are intended to cover, forexample, an entity that produces a circuit design, but does not itselffabricate the design.

FIG. 13 is a block diagram illustrating an example non-transitorycomputer-readable storage medium that stores circuit design information,according to some embodiments. In the illustrated embodimentsemiconductor fabrication system 1320 is configured to process thedesign information 1315 stored on non-transitory computer-readablemedium 1310 and fabricate integrated circuit 1330 based on the designinformation 1315.

Non-transitory computer-readable storage medium 1310, may comprise anyof various appropriate types of memory devices or storage devices.Non-transitory computer-readable storage medium 1310 may be aninstallation medium, e.g., a CD-ROM, floppy disks, or tape device; acomputer system memory or random access memory such as DRAM, DDR RAM,SRAM, EDO RAM, Rambus RAM, etc.; a non-volatile memory such as a Flash,magnetic media, e.g., a hard drive, or optical storage; registers, orother similar types of memory elements, etc. Non-transitorycomputer-readable storage medium 1310 may include other types ofnon-transitory memory as well or combinations thereof. Non-transitorycomputer-readable storage medium 1310 may include two or more memorymediums which may reside in different locations, e.g., in differentcomputer systems that are connected over a network.

Design information 1315 may be specified using any of variousappropriate computer languages, including hardware description languagessuch as, without limitation: VHDL, Verilog, SystemC, SystemVerilog,RHDL, M, MyHDL, etc. Design information 1315 may be usable bysemiconductor fabrication system 1320 to fabricate at least a portion ofintegrated circuit 1330. The format of design information 1315 may berecognized by at least one semiconductor fabrication system 1320. Insome embodiments, design information 1315 may also include one or morecell libraries which specify the synthesis, layout, or both ofintegrated circuit 1330. In some embodiments, the design information isspecified in whole or in part in the form of a netlist that specifiescell library elements and their connectivity. Design information 1315,taken alone, may or may not include sufficient information forfabrication of a corresponding integrated circuit. For example, designinformation 1315 may specify the circuit elements to be fabricated butnot their physical layout. In this case, design information 1315 mayneed to be combined with layout information to actually fabricate thespecified circuitry.

Integrated circuit 1330 may, in various embodiments, include one or morecustom macrocells, such as memories, analog or mixed-signal circuits,and the like. In such cases, design information 1315 may includeinformation related to included macrocells. Such information mayinclude, without limitation, schematics capture database, mask designdata, behavioral models, and device or transistor level netlists. Asused herein, mask design data may be formatted according to graphic datasystem (GDSII), or any other suitable format.

Semiconductor fabrication system 1320 may include any of variousappropriate elements configured to fabricate integrated circuits. Thismay include, for example, elements for depositing semiconductormaterials (e.g., on a wafer, which may include masking), removingmaterials, altering the shape of deposited materials, modifyingmaterials (e.g., by doping materials or modifying dielectric constantsusing ultraviolet processing), etc. Semiconductor fabrication system1320 may also be configured to perform various testing of fabricatedcircuits for correct operation.

In various embodiments, integrated circuit 1330 is configured to operateaccording to a circuit design specified by design information 1315,which may include performing any of the functionality described herein.For example, integrated circuit 1330 may include any of various elementsshown in FIGS. 1, 2, 5-9, and 11. Further, integrated circuit 1330 maybe configured to perform various functions described herein inconjunction with other components. Further, the functionality describedherein may be performed by multiple connected integrated circuits.

As used herein, a phrase of the form “design information that specifiesa design of a circuit configured to . . . ” does not imply that thecircuit in question must be fabricated in order for the element to bemet. Rather, this phrase indicates that the design information describesa circuit that, upon being fabricated, will be configured to perform theindicated actions or will include the specified components.

The present disclosure includes references to “an “embodiment” or groupsof “embodiments” (e.g., “some embodiments” or “various embodiments”).Embodiments are different implementations or instances of the disclosedconcepts. References to “an embodiment,” “one embodiment,” “a particularembodiment,” and the like do not necessarily refer to the sameembodiment. A large number of possible embodiments are contemplated,including those specifically disclosed, as well as modifications oralternatives that fall within the spirit or scope of the disclosure.

This disclosure may discuss potential advantages that may arise from thedisclosed embodiments. Not all implementations of these embodiments willnecessarily manifest any or all of the potential advantages. Whether anadvantage is realized for a particular implementation depends on manyfactors, some of which are outside the scope of this disclosure. Infact, there are a number of reasons why an implementation that fallswithin the scope of the claims might not exhibit some or all of anydisclosed advantages. For example, a particular implementation mightinclude other circuitry outside the scope of the disclosure that, inconjunction with one of the disclosed embodiments, negates or diminishesone or more the disclosed advantages. Furthermore, suboptimal designexecution of a particular implementation (e.g., implementationtechniques or tools) could also negate or diminish disclosed advantages.Even assuming a skilled implementation, realization of advantages maystill depend upon other factors such as the environmental circumstancesin which the implementation is deployed. For example, inputs supplied toa particular implementation may prevent one or more problems addressedin this disclosure from arising on a particular occasion, with theresult that the benefit of its solution may not be realized. Given theexistence of possible factors external to this disclosure, it isexpressly intended that any potential advantages described herein arenot to be construed as claim limitations that must be met to demonstrateinfringement. Rather, identification of such potential advantages isintended to illustrate the type(s) of improvement available to designershaving the benefit of this disclosure. That such advantages aredescribed permissively (e.g., stating that a particular advantage “mayarise”) is not intended to convey doubt about whether such advantagescan in fact be realized, but rather to recognize the technical realitythat realization of such advantages often depends on additional factors.

Unless stated otherwise, embodiments are non-limiting. That is, thedisclosed embodiments are not intended to limit the scope of claims thatare drafted based on this disclosure, even where only a single exampleis described with respect to a particular feature. The disclosedembodiments are intended to be illustrative rather than restrictive,absent any statements in the disclosure to the contrary. The applicationis thus intended to permit claims covering disclosed embodiments, aswell as such alternatives, modifications, and equivalents that would beapparent to a person skilled in the art having the benefit of thisdisclosure.

For example, features in this application may be combined in anysuitable manner. Accordingly, new claims may be formulated duringprosecution of this application (or an application claiming prioritythereto) to any such combination of features. In particular, withreference to the appended claims, features from dependent claims may becombined with those of other dependent claims where appropriate,including claims that depend from other independent claims. Similarly,features from respective independent claims may be combined whereappropriate.

Accordingly, while the appended dependent claims may be drafted suchthat each depends on a single other claim, additional dependencies arealso contemplated. Any combinations of features in the dependent thatare consistent with this disclosure are contemplated and may be claimedin this or another application. In short, combinations are not limitedto those specifically enumerated in the appended claims.

Where appropriate, it is also contemplated that claims drafted in oneformat or statutory type (e.g., apparatus) are intended to supportcorresponding claims of another format or statutory type (e.g., method).

Because this disclosure is a legal document, various terms and phrasesmay be subject to administrative and judicial interpretation. Publicnotice is hereby given that the following paragraphs, as well asdefinitions provided throughout the disclosure, are to be used indetermining how to interpret claims that are drafted based on thisdisclosure.

References to a singular form of an item (i.e., a noun or noun phrasepreceded by “a,” “an,” or “the”) are, unless context clearly dictatesotherwise, intended to mean “one or more.” Reference to “an item” in aclaim thus does not, without accompanying context, preclude additionalinstances of the item. A “plurality” of items refers to a set of two ormore of the items.

The word “may” is used herein in a permissive sense (i.e., having thepotential to, being able to) and not in a mandatory sense (i.e., must).

The terms “comprising” and “including,” and forms thereof, areopen-ended and mean “including, but not limited to.”

When the term “or” is used in this disclosure with respect to a list ofoptions, it will generally be understood to be used in the inclusivesense unless the context provides otherwise. Thus, a recitation of “x ory” is equivalent to “x or y, or both,” and thus covers 1) x but not y,2) y but not x, and 3) both x and y. On the other hand, a phrase such as“either x or y, but not both” makes clear that “or” is being used in theexclusive sense.

A recitation of “w, x, y, or z, or any combination thereof” or “at leastone of . . . w, x, y, and z” is intended to cover all possibilitiesinvolving a single element up to the total number of elements in theset. For example, given the set [w, x, y, z], these phrasings cover anysingle element of the set (e.g., w but not x, y, or z), any two elements(e.g., w and x, but not y or z), any three elements (e.g., w, x, and y,but not z), and all four elements. The phrase “at least one of . . . w,x, y, and z” thus refers to at least one element of the set [w, x, y,z], thereby covering all possible combinations in this list of elements.This phrase is not to be interpreted to require that there is at leastone instance of w, at least one instance of x, at least one instance ofy, and at least one instance of z.

Various “labels” may precede nouns or noun phrases in this disclosure.Unless context provides otherwise, different labels used for a feature(e.g., “first circuit,” “second circuit,” “particular circuit,” “givencircuit,” etc.) refer to different instances of the feature.Additionally, the labels “first,” “second,” and “third” when applied toa feature do not imply any type of ordering (e.g., spatial, temporal,logical, etc.), unless stated otherwise.

The phrase “based on” or is used to describe one or more factors thataffect a determination. This term does not foreclose the possibilitythat additional factors may affect the determination. That is, adetermination may be solely based on specified factors or based on thespecified factors as well as other, unspecified factors. Consider thephrase “determine A based on B.” This phrase specifies that B is afactor that is used to determine A or that affects the determination ofA. This phrase does not foreclose that the determination of A may alsobe based on some other factor, such as C. This phrase is also intendedto cover an embodiment in which A is determined based solely on B. Asused herein, the phrase “based on” is synonymous with the phrase “basedat least in part on.”

The phrases “in response to” and “responsive to” describe one or morefactors that trigger an effect. This phrase does not foreclose thepossibility that additional factors may affect or otherwise trigger theeffect, either jointly with the specified factors or independent fromthe specified factors. That is, an effect may be solely in response tothose factors, or may be in response to the specified factors as well asother, unspecified factors. Consider the phrase “perform A in responseto B.” This phrase specifies that B is a factor that triggers theperformance of A, or that triggers a particular result for A. Thisphrase does not foreclose that performing A may also be in response tosome other factor, such as C. This phrase also does not foreclose thatperforming A may be jointly in response to B and C. This phrase is alsointended to cover an embodiment in which A is performed solely inresponse to B. As used herein, the phrase “responsive to” is synonymouswith the phrase “responsive at least in part to.” Similarly, the phrase“in response to” is synonymous with the phrase “at least in part inresponse to.”

Within this disclosure, different entities (which may variously bereferred to as “units,” “circuits,” other components, etc.) may bedescribed or claimed as “configured” to perform one or more tasks oroperations. This formulation—[entity] configured to [perform one or moretasks]—is used herein to refer to structure (i.e., something physical).More specifically, this formulation is used to indicate that thisstructure is arranged to perform the one or more tasks during operation.A structure can be said to be “configured to” perform some task even ifthe structure is not currently being operated. Thus, an entity describedor recited as being “configured to” perform some task refers tosomething physical, such as a device, circuit, a system having aprocessor unit and a memory storing program instructions executable toimplement the task, etc. This phrase is not used herein to refer tosomething intangible.

In some cases, various units/circuits/components may be described hereinas performing a set of task or operations. It is understood that thoseentities are “configured to” perform those tasks/operations, even if notspecifically noted.

The term “configured to” is not intended to mean “configurable to.” Anunprogrammed FPGA, for example, would not be considered to be“configured to” perform a particular function. This unprogrammed FPGAmay be “configurable to” perform that function, however, Afterappropriate programming, the FPGA may then be said to be “configured to”perform the particular function.

For purposes of United States patent applications based on thisdisclosure, reciting in a claim that a structure is “configured to”perform one or more tasks is expressly intended not to invoke 35 U.S.C.§ 112(f) for that claim element. Should Applicant wish to invoke Section112(f) during prosecution of a United States patent application based onthis disclosure, it will recite claim elements using the “means for”[performing a function]construct.

Different “circuits” may be described in this disclosure. These circuitsor “circuitry” constitute hardware that includes various types ofcircuit elements, such as combinatorial logic, clocked storage devices(e.g., flip-flops, registers, latches, etc.), finite state machines,memory (e.g., random-access memory, embedded dynamic random-accessmemory), programmable logic arrays, and so on. Circuitry may be customdesigned, or taken from standard libraries. In various implementations,circuitry can, as appropriate, include digital components, analogcomponents, or a combination of both. Certain types of circuits may becommonly referred to as “units” (e.g., a decode unit, an arithmeticlogic unit (ALU), functional unit, memory management unit (MMU), etc.).Such units also refer to circuits or circuitry.

The disclosed circuits/units/components and other elements illustratedin the drawings and described herein thus include hardware elements suchas those described in the preceding paragraph. In many instances, theinternal arrangement of hardware elements within a particular circuitmay be specified by describing the function of that circuit. Forexample, a particular “decode unit” may be described as performing thefunction of “processing an opcode of an instruction and routing thatinstruction to one or more of a plurality of functional units,” whichmeans that the decode unit is “configured to” perform this function.This specification of function is sufficient, to those skilled in thecomputer arts, to connote a set of possible structures for the circuit.

In various embodiments, as discussed in the preceding paragraph,circuits, units, and other elements may be defined by the functions oroperations that they are configured to implement. The arrangement andsuch circuits/units/components with respect to each other and the mannerin which they interact form a microarchitectural definition of thehardware that is ultimately manufactured in an integrated circuit orprogrammed into an FPGA to form a physical implementation of themicroarchitectural definition. Thus, the microarchitectural definitionis recognized by those of skill in the art as structure from which manyphysical implementations may be derived, all of which fall into thebroader structure described by the microarchitectural definition. Thatis, a skilled artisan presented with the microarchitectural definitionsupplied in accordance with this disclosure may, without undueexperimentation and with the application of ordinary skill, implementthe structure by coding the description of the circuits/units/componentsin a hardware description language (HDL) such as Verilog or VHDL. TheHDL description is often expressed in a fashion that may appear to befunctional. But to those of skill in the art in this field, this HDLdescription is the manner that is used transform the structure of acircuit, unit, or component to the next level of implementationaldetail. Such an HDL description may take the form of behavioral code(which is typically not synthesizable), register transfer language (RTL)code (which, in contrast to behavioral code, is typicallysynthesizable), or structural code (e.g., a netlist specifying logicgates and their connectivity). The HDL description may subsequently besynthesized against a library of cells designed for a given integratedcircuit fabrication technology, and may be modified for timing, power,and other reasons to result in a final design database that istransmitted to a foundry to generate masks and ultimately produce theintegrated circuit. Some hardware circuits or portions thereof may alsobe custom-designed in a schematic editor and captured into theintegrated circuit design along with synthesized circuitry. Theintegrated circuits may include transistors and other circuit elements(e.g. passive elements such as capacitors, resistors, inductors, etc.)and interconnect between the transistors and circuit elements. Someembodiments may implement multiple integrated circuits coupled togetherto implement the hardware circuits, and/or discrete elements may be usedin some embodiments. Alternatively, the HDL design may be synthesized toa programmable logic array such as a field programmable gate array(FPGA) and may be implemented in the FPGA. This decoupling between thedesign of a group of circuits and the subsequent low-levelimplementation of these circuits commonly results in the scenario inwhich the circuit or logic designer never specifies a particular set ofstructures for the low-level implementation beyond a description of whatthe circuit is configured to do, as this process is performed at adifferent stage of the circuit implementation process.

The fact that many different low-level combinations of circuit elementsmay be used to implement the same specification of a circuit results ina large number of equivalent structures for that circuit. As noted,these low-level circuit implementations may vary according to changes inthe fabrication technology, the foundry selected to manufacture theintegrated circuit, the library of cells provided for a particularproject, etc. In many cases, the choices made by different design toolsor methodologies to produce these different implementations may bearbitrary.

Moreover, it is common for a single implementation of a particularfunctional specification of a circuit to include, for a givenembodiment, a large number of devices (e.g., millions of transistors).Accordingly, the sheer volume of this information makes it impracticalto provide a full recitation of the low-level structure used toimplement a single embodiment, let alone the vast array of equivalentpossible implementations. For this reason, the present disclosuredescribes structure of circuits using the functional shorthand commonlyemployed in the industry.

What is claimed is:
 1. An apparatus, comprising: queue circuitry thatimplements multiple queues configured to queue requests to access amemory bus; control circuitry configured to: in response to detecting afirst threshold condition associated with the queue circuitry, generatea first snapshot that indicates numbers of requests in respective queuesof the multiple queues at a first time; generate a second snapshot thatindicates numbers of requests in respective queues of the multiplequeues at a second time that is subsequent to the first time; andarbitrate between requests from the multiple queues to select requeststo access the memory bus, wherein the arbitration is based on snapshotsto which requests from the multiple queues belong.
 2. The apparatus ofclaim 1, wherein the control circuitry is configured to select allavailable requests from the first snapshot before selecting any requestsfrom the second snapshot that are not in the first snapshot.
 3. Theapparatus of claim 2, wherein the control circuitry implements a creditsystem; and wherein the control circuitry is configured to select one ormore requests from the second snapshot, that are not in the firstsnapshot, while one or more requests from the first snapshot are stillqueued based on one or more requests from the first snapshot beingunavailable due to lack of credits.
 4. The apparatus of claim 1, whereinthe control circuitry is configured to provide a greater priority weightto requests from the first snapshot than requests from the secondsnapshot that are not in the first snapshot.
 5. The apparatus of claim1, wherein the control circuitry is configured to select all requestsfrom the first snapshot before selecting any requests from the secondsnapshot that are not in the first snapshot.
 6. The apparatus of claim1, wherein the control circuitry is configured to use weightedround-robin arbitration among the queues prior to detecting the firstthreshold condition and round-robin arbitration among queues forrequests in the first snapshot subsequent to detecting the firstthreshold condition.
 7. The apparatus of claim 1, wherein the controlcircuitry includes at least three sets of counters configured tomaintain information that indicates current numbers of requests in threeor more snapshots for respective ones of the queues, wherein the controlcircuitry is configured to update the counters based on queued requestswinning arbitration.
 8. The apparatus of claim 1, wherein one or more ofthe queues maintain ordering among requests of one or more virtualchannels that are shared by multiple client circuits.
 9. The apparatusof claim 1, further comprising: a cache that includes a first portionassociated with a first processor sub-unit and a second portionassociated with a second processor sub-unit; and router circuitryconfigured to route cache access requests from the second processorsub-unit to the first portion of the cache; wherein the queue circuitryand control circuitry are included in the router circuitry.
 10. Theapparatus of claim 9, wherein the cache includes at least three portionsassociated with respective processor sub-units and the router circuitryis configured to route cache access requests in at least threedirections.
 11. The apparatus of claim 1, wherein the queue circuitryand control circuitry are included in circuitry that routes memoryaccess requests for one or more shader cores of a graphics processor.12. A non-transitory computer readable storage medium having storedthereon design information that specifies a design of at least a portionof a hardware integrated circuit in a format recognized by asemiconductor fabrication system that is configured to use the designinformation to produce the circuit according to the design, wherein thedesign information specifies that the circuit includes: queue circuitrythat implements multiple queues configured to queue requests to access amemory bus; control circuitry configured to: in response to detecting afirst threshold condition associated with the queue circuitry, generatea first snapshot that indicates numbers of requests in respective queuesof the multiple queues at a first time; generate a second snapshot thatindicates numbers of requests in respective queues of the multiplequeues at a second time that is subsequent to the first time; andarbitrate between requests from the multiple queues to select requeststo access the memory bus, wherein the arbitration is based on snapshotsto which requests from the multiple queues belong.
 13. Thenon-transitory computer readable storage medium of claim 12, wherein thecontrol circuitry is configured to provide higher priority to requestsin the first snapshot than to requests in the second snapshot.
 14. Thenon-transitory computer readable storage medium of claim 12, wherein thecontrol circuitry is configured to select all available requests fromthe first snapshot before selecting any requests from the secondsnapshot that are not in the first snapshot.
 15. The non-transitorycomputer readable storage medium of claim 12, wherein the controlcircuitry is configured to use weighted round-robin arbitration amongthe queues prior to detecting the first threshold condition andround-robin arbitration among queues for requests in the first snapshotsubsequent to detecting the first threshold condition.
 16. Thenon-transitory computer readable storage medium of claim 12, wherein thecontrol circuitry includes at least three sets of counters configured tomaintain information that indicates current numbers of requests in threeor more snapshots for respective ones of the queues, wherein the controlcircuitry is configured to update the counters based on queued requestswinning arbitration.
 17. The non-transitory computer readable storagemedium of claim 12, wherein one or more of the queues maintain orderingamong requests of one or more virtual channels that are shared bymultiple client circuits.
 18. A method, comprising: generating, bycontrol circuitry, a first snapshot that indicates numbers of requestsin respective queues of multiple queues at a first time, wherein thegenerating is performed in response to detecting a first thresholdcondition associated with queue circuitry that includes the multiplequeues; generating, by the control circuitry, a second snapshot thatindicates numbers of requests in respective queues of the multiplequeues at a second time that is subsequent to the first time; andarbitrating, by the control circuitry, between requests from themultiple queues to select requests to access a memory bus, wherein thearbitrating is based on snapshots to which requests from the multiplequeues belong.
 19. The method of claim 18, wherein arbitrating includesselecting all available requests from the first snapshot beforeselecting any requests from the second snapshot that are not in thefirst snapshot.
 20. The method of claim 18, wherein the controlcircuitry uses weighted round-robin arbitration among the queues priorto detecting the first threshold condition and uses round-robinarbitration among queues for requests in the first snapshot subsequentto detecting the first threshold condition.